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Abstract 

In this paper we introduce two procedures for variable selection 
in cluster analysis and classification rules. One is mainly oriented to 
detect the "noisy" non-informative variables, while the other deals 
also with multicolinearity. A forward-backward algorithm is also pro- 
posed to make feasible these procedures in large data sets. A small 
simulation is performed and some real data examples are analyzed. 
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1 Introduction 



In multivariate analysis there are several statistical procedures whose output 

is a partition of the space. Typical examples of this situation are cluster 
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analysis and classification rules. In cluster analysis (or un-supervised clas- 
sification) we look for a partition of the space into homogeneous groups or 
clusters (with small dispersion within groups), that help us to understand 
the structure of the data. Several cluster methods have been proposed, such 
as hierarchical clustering (Hartigan, 1975), k-means (MacQueen, 1967), k- 
mediods (Kaufman and Rousseeuw, 1987), kurtosis based clustering (Pena 
and Prieto, 2001). From most of them we get a partition of the space in 
disjoint subsets. 

Pattern recognition or classification is about guessing or predicting the 
unknown nature of an observation, a discrete quantity such as black or white, 
one or zero, sick or healthy. An observation is a collection of numerical 
measurements such as an image (which is a sequence of bits, one per pixel), 
a vector of weather data, or an electrocardiogram. In classification rules, 
we have in addition a training sample for each group, from which we know 
together with the observation of the random vector of variables, a label that 
indicates to which subpopulation it belongs. Then a classifier is any map 
that represents for each new data our guess of the class, given its associated 
vector. The map produces a classification rule, that is also a partition of 
the space. According to which subset of the partition a new data belongs, is 
classified in that class. There is also an extensive literature on classification 
rules, such as Fisher's linear discrimination (Fisher, 1936), nearest neighbor 
rules (Fix and Hodges, 1951), regression trees-CART (Breiman et al., 1984), 
or reduced kernel discriminant analysis (Hernandez and Velilla, 2005). 

A general problem in cluster or classification is to find structures in a high 
dimensional variable space but with small data sets. It is common that in 
many practical cases, the amount of variables (that should not be confused 
with the amount of information) is too high. This may be due to the presence 
of several "noisy" non-informative variables, and/or redundant information 
from strongly correlated variables that may produce multicolinearity. Then 
the information contained in the data set could be extracted from a reduced 
subset of the original variables. 

A difficult task is to find out which variables are "important" , where the 
concept of "important" should be related to the statistical procedure we are 
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dealing with. If we are interested in cluster analysis, we would like to find the 
variables that explains the groups we have found. In this way, a (small) sub- 
set of variables should "explain" as best as possible the statistical procedure 
in the original space (the high dimensional space). Dimension reduction tech- 
niques (like principal component analysis) will produce linear combinations 
of the variables which are difficult to interpret unless most of the coefficients 
of the linear combination are negligent. The variable selection method of 
Fowlkes, et al. (1988) shifts the problem to a reduced variable space and 
looks for new clusters with less variables. Tandesse, et al. (2005) propose 
a Bayesian approach for simultaneously selecting variables and identifying 
cluster structures without knowing the number of clusters. The Bayesian 
model with latent variables is very useful in cluster analysis since it produces 
the most complete output: number of clusters, data allocation and informa- 
tive variables. To solve the model it is necessary to use MCMC methods, 
in particular Metropolis-Hastings with Reversible- Jump (Green, 1995), that 
introduce an important complexity to the users that are not familiar with 
computer programming. 

In this paper we propose consistent statistical methods for variable selec- 
tion that are easy to use. The variables that explain better the procedure 
on the original space help us to understand better the cluster output, and 
as a by-product, we find a dimension reduction procedure that can be used 
in a new data set for the same problem. We consider two different proposals 
based on the idea of "blinding" unnecessary variables. To cancel the effect 
of one variable, we substitute all the values of that variable by it's marginal 
mean in the first proposal and by the conditional mean in the second pro- 
posal. The marginal mean approach is mainly oriented to detect the "noisy" 
non-informative variables, while the conditional mean approach is more re- 
lated to deal also with multicolinearity. The first one is simpler and does not 
require large sample size as the second one. In practice, we will also need an 
algorithm to solve the optimization problem. 

In Section 2 we define in precise terms what we understand for a subset 
of variables that explains a multivariate partition procedure. Next we define 
our objective function and provide a strongly consistent estimate of the op- 
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timal subset. A small simulation study is also performed. In Section 3 we 
introduce the proposal based on the conditional mean and show the perfor- 
mance in a simulated data set. In Section 4 we describe a forward-backward 
selection algorithm that looks for the minimum subset that explains a fixed 
percentage of the data assignation to the clusters. Section 5 is devoted to 
the analysis of two real data examples with medium and large dimensional 
variable spaces. Section 6 includes some final remarks and the proofs are 
given in the Appendix. 



2 Dropping out noise non— informative vari- 
ables 

Let X = (Xi, . . . , X p ) be a random vector with distribution P. We consider 
any statistical procedure whose output is a partition of the space RP. For 
instance, this is the case of the population target for most clustering methods 
or classification rules. To fix ideas we will concentrate in cluster methods. 
For a fix number of clusters K, we have a function 

which determines to which cluster each single point belongs. We denote the 
space partition by Gk = f~ 1 (k), k = 1, . . . , K, that satisfies 

For instance, if we consider k — means (with K=2), and c±, c-i e RP are the 
cluster centers, i.e. the values that minimize 

£(min(||X- Cl || 2 ,||X-c 2 || 2 )), 

the set Gi is given by G\ = {x 6 RP : \\x — c\\\ < \ \x — c 2 ||}, while G 2 = G\. 

If p is large, typically some of the components of vector X are strongly 
correlated or might be almost irrelevant for the cluster procedure. Then, if 
the information from the noisy variables is removed from our data, we should 
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expect that their cluster allocations does not change. These means that the 
data are kept in the same group as in the original partition. The key point 
is to notice that the partition is defined in the original p-dimensional space 
and the input data requires information from all the variables, included the 
noisy ones. We propose to look for the subset of indices Id {1, . . . ,p} for 
which the original partition rule applied to a new "less informative" vector 
Y 1 G RP built up from X, behaves as close as possible to the procedure when 
is applied to the "full information" vector X. The vector Y 1 contains the 
variables from X that are index by /, and the rest of the variables with index 
outside the set / are "blinded". A noisy variable means that its probability 
distribution is almost the same at all the clusters. This suggests to substitute 
the information in the "blinded " variables by their mean value. 

It will depend on the problem (the distribution P of X) and on how many 
variables d < p we select, the percentage of cluster allocations explained by 
them. In practice, we can choose d in order to explain at least a fixed 
percentage, for instance, 90%, 95% or 100% of the data. 

2.1 Population and empirical objective functions 

We now put our purpose in a precise setup. Given a subset of indices 

/ = {h, . . .,i d } C {1, . . . ,p}, 

we define the vector Y 1 := Y = (Yi, . . . , Y p ), where Yi = if % G / and 
Yi = E(Xi) otherwise. Note that instead of the expectation E(Xi) we can 
use the median of Xi or any other location parameter for the z-coordinate 
like M-estimates or trimmed-means. The results will still holds provided we 
have a strong consistent estimate of the location parameter. 

For a fixed integer d < p, the population target is the set / C {1, . . . ,p}, 
#1 = d, for which the population objective function, given by 

K 
k=l 

attains it maximum. 
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In this way, we look for the subset I for which the original partition rule 
applied to the less informative random vector Y 1 behaves as close as possible 
to the procedure when is applied to the "full information" vector X. All 
components with index outside the set I are blinded in the sense that are 
constant. 

In practice, the empirical version consist on the application of the next 
steps: 

1. Given iid data Xi, . . . ,X n e R p , apply the partition procedure to the 
data set and obtain the empirical cluster allocation function, 

f n :R p ^{l,...,K}, 

where now f n (x) is data dependent. The associated space partition will 
be denoted by = f-\k), for k = 1, . . . , K. 

2. For a fixed value d < p, given a subset of indices / C {1, . . . ,p}, with 

= d, define the random vectors {X*, 1 < j < n} verifying 

X*M = Xj[i] if i G /, and X*[i] = X[i] otherwise, 

where X[i] stands for the z-coordinate of the vector X, and X[i] stands 
for the i-coordinate of the average vector. 

If we have used instead of the expected value other location parameter 
in the population version (like the median), we substitute the average 
by the empirical version (the sample median). 

3. Calculate the empirical objective function 

^ K n 

= ~5Z5Z J {/™(^) = fc} J {/n(X*) = fc}, 
k = l j = l 

where Xa stands for the indicator function of the set A. 

4. Look for a subset Id, n ='■ In, with j^I n = d, that maximizes the empir- 
ical objective function h n . 
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2.2 Consistency. Assumptions and main result 

As expected, the consistency of our variable selection procedure is linked to 
the properties of the cluster partition method. We now give some conditions 
under which our procedure is consistent. 

Assumption 1: 

a) The partition procedure is strongly consistent, i.e., given e > 0, there 
exists a set A(e) C BP with P(X e A(e)) > 1 - e, such that for all 
r > 

lim sup \Z{f n ( x )=k} - 1{f(x)=k}\ = a.s., for k = 1, . . . , K, 

where C(e,r) = A(e) fl B(0,r) stands for the intersection of the set 
A(e) and the closed ball centered at zero of radius e, 5(0, r). 

b) 

d(X, dG n k ) - d(X, dG k ) -> a.s., for k = 1, . . . , K, 
where d(X, dGk) stands for the distance from X to the frontier of Gu- 

Assumption 2: 

lim P(d(Y, dG k ) < 5) = 0, for k = 1, . . . , K. 

5— >0 

Assumption la holds typically for cluster and classification rules, where 
the set A(e) is the complement of an e-neighborhood ("outer parallel set") 
of the partition boundaries as shown in Figure ^ i.e. 

A(ey= |J B(x,e), 

where B(x, e) denotes the ball with center x and radius e. 

Theorem 1 (Strong Consistency) Let {Xj : j > 1} be iid random vectors 
with distribution Px ■ Given d, 1 < d < p, let Id be the family of all subsets 
of {!,...,£>} with cardinal d, and l^o C Id the family of subsets where the 
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Figure 1: Excluding a neighborhood of the partition boundaries, we have 
almost sure uniform convergence of the function f n to / over compact sets 
(the color area A(e)). 

maximum of h(I) is attained, for I £ Id- Then, under assumptions 1 and 2 
we have that there exists = no(u), such that 



The proof is given in the Appendix. 

2.3 Selection of variables in simulated data 

In order to analyze our method performance, we carry out a Monte Carlo 
study for some simulated date sets. In all of them we generated 100 obser- 
vations in a three dimensional variable space. The underlying distributions 
are mixtures of three multivariate normals, 



where ot\ = a 2 = 0.35 and as = 0.30. The cluster structure is defined 
through Xi and X 2 and, to simplify, we consider they are independent in all 
the cases, with distributions given by 



4 £ h,o for n > n {oj) a.s. 




X 1 ~ aiJV(0, 0.2) + a 2 M(0.1, 0.2) + a 3 N '(0.9, 0.2) 
X 2 ~ aiJV(0, 0.2) + a 2 ^f (0.9, 0.2) + a 3 A/"(0.1, 0.2). 



For the distribution of X 3 we consider two different scenarios. 



S 



jL 
11 






•V* 


L 


..Jjjfc.. 




:*4 





-1 1 2 -0.5 0.5 



Figure 2: Scatter plots and histograms from a three dimensional data set 
generated following Case I description with a = 0.2 

Case I: X 3 is an independent "noise" variable with distribution given by 

X 3 ~N{0,a), 

where a takes different values, 0.1, 0.2 and 0.3. Figure |2] shows a simulated 
data set from these distributions with a = 0.2. The three clusters are per- 
fectly distinguish when plotting the pairs (xi,X2), however only two clusters 
are appreciated in the scatter plots that consider X 3 , as it is the case of the 
Xi and X 2 histograms. 

Case II: X 3 is not an independent variable and is given by 

X 3 = {X\ + X<2)j\p2. 

In Table Qwe report the proportion of times where the information in only 
one, two or three variables is enough to explain all the cluster allocations. 
We also consider the effect of a possible reduction in the efficiency to only 
95%, or 90%, of correct allocations. In all the cases we carried out 1,000 
replications and follow the next steps. 

1. Generate Xi, . . . , X WQ observations. 

2. Split the data into three cluster using the /c-means algorithm. 

3. Search the optimal subset of variables for 100%, 95% and 90% efficien- 
cies. 
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Efficiency 


Number of variables 
1 2 3 


X 3 ~ M(0,a) 


100% 

a = 0.1 95% 
90% 


0.997 0.003 
0.005 0.995 
0.008 0.992 


100% 

a = 0.2 95% 
90% 


0.926 0.074 
0.003 0.986 0.011 
0.005 0.994 0.001 


100% 

a = 0.3 95% 
90% 


0.736 0.264 
0.976 0.024 
0.006 0.988 0.006 


X 3 = (X 1 + X 2 )/V2 


100% 

95% 
90% 


0.146 0.854 
0.001 0.970 0.029 
0.003 0.990 0.007 



Table 1: Simulation results from the Monte Carlo study carried out using 
the distributions proposed in cases I and II 

In the first case our variable selection method is very successful and selects 
only the two variables X\ and X 2 in almost all the simulations, for 100%, 
95% and 90% efficiencies. A different scenario appears with case II, where 
the third variable is a linear combination of the first two variables. Only 
in the 14.6% of the times the two variable subset explains all the cluster 
allocations. This changes dramatically when we allow for a 5% or 10% of 
miss-classified observations, now 97% of the times the method selects only 
two variables, instead of three. 

Case II shows and interesting feature of the selection variable procedure, 
it is able to eliminate noise variables, but it is unable to detect redundant 
information from co-linear variables. This effect can be more clearly seen 
with the simulated example proposed by Tadesse, Sha and Vannucci (2005). 
The data consists on the 15 three-dimensional observations displayed in Fig- 
ure^- The first four observations come from independent normals with mean 
fii = 5 and variance <y\ = 1.5. The next three data come from independent 
normals with mean \i 2 — 2 and variance o\ = 0.1. The following six data 
come from independent normals with mean /i 3 = —3 and variance <rf = 0.5, 
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Figure 3: a) Dots are the simulated data TSV05 as in Tadesse et al. (2005) 
and stars are the four fc-means centers, b) results of blinding the vertical 
coordinate with the mean value. 

while the last two come from independent normals with mean /x 4 = —6 and 
variance erf = 2. Despite in Tadesse et al. (2005) the data set was generated 
with twenty-dimensional observations instead of three-dimensional, we call 
TSV05 to this data set. 

We first run the /c-mean algorithm with k = 4, which classified correctly 
the whole data set. Then, we run the variable selection procedure based 
on the mean value (dropping out noisy non-informative variables). A closer 
look to this data generating mechanism indicates that one should expect to 
attain a 100% efficiency with only one variable, since we have the same cluster 
structure at the three coordinates. However, the procedure was unable to 
find the cluster structure blinding all variables except one. The efficiencies 
in Table El show that only the subset with the three variables classify all the 
data in their original clusters. This result is expected since all the variables 
contain information about the cluster, they are not noisy variables. However, 
as in case II, these colinear variables are redundant and would be interesting 
to develop a variable selection method able to detect them. 

Figure Eb helps us to understand the main problem that appears when we 
blind one variable by substituting all the data for the mean value. We observe 
the case of blinding the vertical coordinate, that means a projection of all 
the data in the shadow mean plane. As the mean is not a representative 
value for data generated from a cluster structure, the allocations will be 
by chance to any of the clusters. For instance, we point out the correct 
center for one projected data with a discontinuous arrow, however in this 
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Subset 




x 2 


^3 


X±,X2 


Xi,Xs 


X2,Xs 


Xi,X2, Xs 


Efficiency 


60% 


60% 


60% 


66.66% 


73.33% 


86.66% 


100% 



Table 2: Percentage of correct allocations in the TSV05 data set using the 
variable selection method based on the mean. 



case the closer center is a different one. This data is wrongly allocated with 
the variable selection method. Remember that we blind the variable but 
not the corresponding coordinate of the fc-mean centers. Then to eliminate 
not only noisy variables but also colinear variables the idea is to blind with 
local information, instead of using the mean. This would not be a problem 
for noisy variables and we will see in the next section that it is crucial for 
multicolinearity. 



3 Dealing with multicolinearity 

The previous procedure is mainly designed to find "noisy" non-informative 
variables, however as the simulated data set highlighted, it may fail in the 
presence of colinearity. In order to deal with this problem, we consider a 
quite natural extension, changing the definition of the "less informative" 
vector Y 1 . Recall that we defined Y/ = Xi, if i G /, and Y/ = E(Xi) 
otherwise. Thus, for indices in the complement of the set /, Y/ is defined as 
the best constant predictor. Now the idea appears clearly, to change means 
by conditional means. We define the less informative vector Z 1 for indices i in 
the complement of the set / as the conditional expectation of Xi given the set 
of variables {Xi : I G /}, i.e. the best predictor of Xi based on those variables. 
This procedure will be able to deal with both kinds of problems. However, 
at a first look, a shortcoming is that it will require a large sample size in 
order to estimate the conditional expectation. Also the computational effort 
is quite bigger. The choice of the smoothing parameter is also challenging, 
since it must involve not more data than the size of the smaller cluster (if 
we think for instance in local averages). If m n is the size of the smallest 
group for the partition procedure, and for each d and n, r = r(n,d) is the 
number of nearest neighbor's we will need to require that r < m, together 
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with the standard conditions r/n —>■ , and n(r/n) d — > oo, as n — > oo. We 
now describe briefly the proposal in a precise setup. 

3.1 Population and empirical objetive function 

Given a subset of indices 

/= {h . . . ,i d } C {1, . . . ,p}, 

let 

X[I] =: (X^, . . .,X id ), ior h <i 2 < ... < i d . 

We define the vector Z 1 := Z = (Z\, . . . , Z p ), where Z{ = X{ if i G 
I and Zi = E(Xi\X[I]) otherwise. Instead of the conditional expectation 
E(Xi\X[I]) in order to attain robustness we can use local medians, or local M- 
estimates (see for instance, Stone, 1977, Truong, 1989 or Boente and Fraiman, 
1995). 

For a fix integer d < p, now the population objective function is the set 
/c{l,...,p}, $1 = d, for which the function 

K 

h(I) = Y,P{f(X) = k,f(Z I )=k), 
fe=i 

attains it maximum. 

In practice, the empirical version consists on the same steps than in the 
method based on using the mean, except the second, that is substitute by 
the next step: 

2'. For a fixed value of d < p, given a subset of indices / C {1, . . . , p}, with 
#J = d, fix an integer value r (the number of nearest neighbor to be 
used). For each j = 1, . . . , n, find the set of indices Cj of the r-nearest 
neighbor's of Xj[I] among {Xj/], . . . ,X n [I]}. 

Now define the random vectors {X*, 1 < j < n} verifying 

X*[i] = Xj[i] if % G /, and X*[i] = - X m [i] otherwise, 

m£Cj 
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where X[i] stands for the z-coordinate of the vector X. 

A resistant procedure would take the local median instead of the local 
mean for i I , i.e. X*[i] = median({X m [i] : m G Cj}). 

3.2 Consistency. Assumptions and main result 

As we have seen before the consistency of the variable selection method relays 
on the properties of the cluster partition methods. Moreover, regularity 
conditions on the boundary of the partitioning sets, in order to carry out the 
nonparametric regression, are requested. Now, we give the conditions under 
which our procedure is consistent. Together with Assumption 1 we will need: 

Assumption 3: 

lim P(d(Z, dG k ) < 5) = 0, for all k = 1, . . . , K. 

Assumption 4'- 

sup \gi, n {x) - 9i{x)\ a.s., for i £ I 

x 

where gi(x) = E(Xi\X(I) = x) is the corresponding non-parametric 
regression functions and gi, n {%) is a consistent estimate of gi(x). 

Assumption 4 allows to use any uniformly consistent estimate of the re- 
gression function, although we have only describe above the case of r-nearest 
neighbor estimates. 

Theorem 2 (Strong Consistency) Let {Xj : j > 1} be iid random vectors 
with distribution Px- Given d, 1 < d < p, let Id be the family of all subsets 
of {1, . . . ,p} with cardinal d, and J^o C Id the family of subsets where the 
maximum of h(I) is attained, for I G Id- Then, under assumptions 1, 3 and 
4 we have that there exists no = uq^uj), such that 

4 e h.o for n > n (uj) a.s. 

The proof of Theorem 2 is very similar to that of Theorem 1 and we omit 
it in full detail. We only point out the differences at the Appendix. 
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3.3 TSV05 example revisited 



When we apply the conditional mean selection variable method to the 15 
three dimensional data of TSV05, we now obtain that with only one variable 
we attain a 100% efficiency This is the case of the second or the third 
variable, while with the first one we obtain a 93.3% efficiency 

A slightly different version of this example includes three new variables. 
The additional noisy coordinates are generated from independent standard 
normal distributions. The two variable selection procedures applied to the 
six-dimensional data set produce exactly the same results as for the three 
dimensional data. The noisy variables are not necessary to reach 100% effi- 
ciency. 

4 A forward— backward algorithm 

A well known feature of the variable selection problem is the great number 
of subsets that should be considered even for moderate values of p. An ex- 
haustive search guarantees to find the smaller subset of variables to achieve, 
at least, a fixed percentage on the empirical objective function, however this 
procedure is non feasible when many variables are considered. For instance, 
if p = 50 we should check among more than 10 15 combinations. Alterna- 
tively, we propose a computationally less expensive forward-backward search 
algorithm. We run the search meanly in the forward mode and include the 
last step in the backward mode. 

The algorithm starts from a one variable set and, progressively, includes 
new variables with an iterative revision of the inclusions in each step. In 
general, the backward search is less costly, but the leave-one-out strategy will 
make difficult to find a small subset. When a set provides a percentage of 
good classifications over the fixed percentage, the backward process starts the 
search of a more parsimonious solution. To compute the objective function we 
can blind the variables either replacing them by the mean or the conditional 
mean (in this case the conditional distribution is towards the chosen subset 
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up to that step). The estimation of the conditional mean is done by nearest 
neighbors. 

We distinguish three parts in the algorithm design, that are sequentially 
implemented. 

Part 1: Select the most "influential" variable X^ 1 ' (the data assignation is 
more affected by its absence), blinding one by one all the variables and 
selecting the one with minimum value of the objective function, 

= arg min h n (Ij) , 

i<i<p 

where Ij denotes all the variables except the j , that is blinded. 

Part 2: Sequential increment of variables one by one (forward search). In 
each step we look for the accompanying variable, such that the new 
subset maximizes the number of successfully data allocations. We also 
consider replacement of previously introduced variables following the 
iterative scheme described by Miller (1984) as a variation of the classical 
forward-backward methods. The increment continues until the fixed 
percentage of well classified data is reached. 

Part 3: The subset is revised for unnecessary variables (backward search). 
The previously introduced variables are questioned one by one whether 
they are necessary. The algorithm stops when no further reduction can 
be found without loss of efficiency 

This procedure strongly depends on the order in the variable vector. To 
avoid (or minimize) the label effect we run the algorithm for a random sam- 
ple of the permuted variables. We finally select the solutions that use the 
minimum number of variables. 

In the following section we illustrate the algorithm performance in real 
data examples. The matlab codes are available upon request to the authors. 
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5 Real Data Examples 



5.1 Evaluation of educational programs 

We have survey data concerning education quality from 98 schools in the 
city and suburbs of Buenos Aires (Argentine). The survey and posterior data 
analysis was developed by Llach et al. (2006). An important objective on this 
study was to find homogeneous groups of schools and the characterization of 
the clusters. The selection variable method is a powerful tool to separate all 
the variables with real influence from those that are non-informative. 

At each school, a questionnaire with fifteen items was fulfilled by the 
headmaster and the teachers. The questions regard on the human and di- 
dactic resources, the relationships between all the involved agents and the 
building physical condition. All the answers range in a discrete scale be- 
tween 1 and 100. The items V\ to Vs are answered by the headmasters and 
refer to their experience, aptitude, school general knowledge, evaluation of 
the building conservation, evaluation of the didactic resources, relationships 
with teachers, parents and students. The items Vg to V^5 are answered by the 
teachers and the questions are the same V\ to V 8 , except V 3 (school general 
knowledge) that is only answered by the headmasters. 

In Llach et al. (2006), a /c-means cluster procedure was performed using 
the 98 fifteen dimensional vectors. The data were split into three clusters of 
sizes 45, 21 and 19 respectively. 

The relationship between the clusters and the mean scores in a general 
knowledge exam (GKE) and the mean socioeconomic level of the students 
(SEL) are shown in TableEl Both the GKE results and SEL are significatively 
different among clusters, with ANOVA p — values <C 0.0001. The clusters 
with higher mean level of student knowledge correspond with those with also 
higher mean socioeconomic level. The question now is which variables have 
relevant information to establish this school grouping. 

We select the variables that determine the clusters according to the first 
proposal, with an exhaustive search (it is possible because of the moderate di- 
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Mean Std 


Mean Std 


Cluster 1 
Cluster 2 
Cluster 3 


49.25 9.18 
58.04 13.01 
64.60 10.94 


49.51 9.93 
63.49 16.39 
68.80 11.79 



Table 3: Means and standard deviations for the student general knowledge 
exams (GKE) and the student socioeconomic levels (SEL). 

mension of the data). The clusters are completely explained (100% efficiency) 
by V3, V4, V 7 , Vg, V11, V12, Vu, V15, that are headmaster's school general knowl- 
edge, evaluation of the building conservation, relationships with parents and 
students, and teacher's evaluation of the building conservation, evaluation of 
the didactic resources, relationships with parents and students. 

As the requested efficiency decays, so does the number of variables, the 
subset includes the variables Vi, V4, V7, Vn, V12, V14 for 98% efficiency. For 
95% efficiency several subsets of size six were found. To achieve 92% effi- 
ciency, we found two optimal subsets with only four variables V4, V 7 , Vn, Vu 
or V4, V 7 , V12, V14, that are headmaster's evaluation of the building conserva- 
tion, relationships between headmasters and parents, and between teachers 
and parents. The elective variables are teacher's evaluation of the building 
conservation or teacher's evaluation of the didactic resources In all the cases 
the subsets contain information from the headmasters and the teachers. 

With the aim of studying the algorithm performance we run it with 100 
permutations and the results were consistent with the exact procedure. We 
found almost all the subsets that were found before. 

To refine the previous results we apply the conditional procedure to de- 
tect colinearity. When we apply either the exact procedure or the algo- 
rithm we found the same subsets, variables V2, V 3 , V4, V 7 , Vg, Vu, V12, Vu or 
V 3 ,V 4 , V 7 , V 8 , V 9 , Vii,Vi2, V u reach 100% efficiency. For 97% efficiency the 
variables founded are V3, V4, V7, Vn, Vu, and only three variables V4, V7, V14 
are requested to explain 91% of the cluster allocations. These final variables 
include headmaster's evaluation of the building conservation, relationship 
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between headmasters and parents, and between teachers and parents. 

All over the data analysis we observe the importance of the relationships 
with the parents, both with the headmasters and teachers. When we look 
for non-noisy variables we found that also relationships with students have 
relevant information about the cluster origin. However, these variables are 
eliminated from the final subset when we use the conditional mean, this 
means that the opinion about the relationships with the students contains 
redundant information. 

5.2 Identifying types of electric power consumers with 
functional data 

We consider the same example presented in Cuesta-Albertos and Fraiman 
(2006), where an impartial-trimming cluster procedure is proposed for func- 
tional data. The study was oriented to find behavioral patterns of the electric 
power home-consumers in the City of Buenos Aires. For each home, mea- 
surements were taken every 15 minutes during all the weekdays of January 
2001. The analyzed data were the vectors of dimension 96 with the monthly 
averages for a sample of 101 home-consumers. The data were normalized 
in such a way that the maximum of each curve was equal to one. Cuesta- 
Albertos and Fraiman (2006) found a two clusters structure, 13 outliers apart. 
The resulting trimmed 2-mean functions (cluster centers) are shown in Fig- 
ure |U Then the non-trimmed functions were assigned to the closest center 
and with this criteria the first cluster is composed of 33 home-consumers 
and the second one of 55. The remaining 13 data have been considered as 
outliers. 

In this example, the set of variables includes all the electricity consump- 
tions in the 15 minutes time-intervals in a day, that is 96 variables. A number 
too large for the computation of the exact objective functions in all the pos- 
sible subsets. Therefore, in order to find the more relevant "windows-times" 
for the cluster procedure we need to run the forward-backward search al- 
gorithm. We apply both the mean and the conditional mean selection of 
variable algorithms for a 90%, 95% and 100% of efficiency For the calcu- 
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Cluster 1 Cluster 2 




06:00 12:00 18:00 06:00 12:00 18:00 



Figure 4: Home-consumers and cluster centers (a-2-mean functions with 
trimming proportion a = 13/101), of the two cluster structure for the elec- 
tricity consume functional data. 
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lation of the conditional mean, we consider 5, 10 and 33 nearest neighbor's 
(NN). The results after 100 permutations are summarized in Table 0] 

The use of the conditional mean algorithm, instead of the faster mean al- 
gorithm, reduces in all the cases the number of time-intervals that provides 
enough information to characterize the two electric power home-consumer 
typologies. The results show that the choice of the number of nearest neigh- 
bor's is also important, although the method seems to be less sensitive than 
non-parametric regression. However, it is an important problem to be solved. 
In our case, the results for 5-NN are quite satisfactory: for a 100% of effi- 
ciency, there is only one solution with 9 variables; for a 95% of efficiency we 
found 15 different solutions with six variables; while for a 90% of efficiency 
we found 5 different solutions with four variables. We choose one of them to 
illustrate in Figure El the "window-times" (non-shadow areas) which seems 
relevant. 

The most informative consume registers are confined to a few "window- 
times" (see Figure and the two types of electric power consumers are 
mainly characterized by their different aptitudes at some time-intervals in the 
morning (7:00 to 11:00), evening (15:00 to 19:00), night (21:00 to 24:00) and 
early morning (3:00 to 4:00). Comparing the mean and the 5-NN conditional 
solutions we observe that the redundant information, specially at evening 
and night, is summarized in the smaller subset of variables found by the 
mean conditional algorithm. When we reduce the degree of efficiency and 







num. of 




num. of 




num. of 


NN 


Effic. 


variables 


Effic. 


variables 


Effic. 


variables 


5 


90% 


4 


95% 


6 


100% 


9 


10 




3 




5 




16 


33 




3 




7 




28 


Mean 


90% 


15 


95% 


22 


100% 


33 



Table 4: Optimal number of variables (time-intervals) for different efficiency 
percentages and number of nearest neighbor's using both the mean and the 
conditional mean selection of variable algorithms. 
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Figure 5: Two-mean electricity consume cluster centers for the functional 
data. Non-shadow time-intervals correspond to the subset of variables found 
by the 5-NN Conditional Mean Algorithm, with different degrees of efficiency 

accept a number of missclassifications, the importance of the early morning 
behavior diminished. 

6 Final Remarks 

We propose two variable selection procedures particularly design for partition 
rules (typically supervised and un-supervised classification methods) that 
help to understand the results for high- dimensional data. Both methods are 
strongly consistent. The second procedure, based on conditional means, is 
much more flexible and takes into account general dependence structures 
within the data. The performance of our proposals in simulated and real 
data examples is quite impressive. 

For low or moderate dimensional data an exhaustive search is possible 
for even the case of 100% efficiency. However, it is unfeasible for high- 
dimensional data and we propose a forward-backward algorithm. We com- 
pare the algorithm performance with the exhaustive search in some of the 
examples and the results are very positive since they provide the same sub- 
sets. However, it will demand a considerable computational effort in time 
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Figure 6: Two-mean electricity consume cluster centers for the functional 
data. Non-shadow time-intervals correspond to the subset of variables found 
by the Mean and the 5-NN Conditional Mean Algorithms, for different de- 
grees of efficiency 
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that suggest that some additional research should be consider in this aspect. 
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Appendix 

Proof of Theorem 1 

To simplify the proof, we assume that there exists a unique subset /^q =: 
-^o = Oi; • • • > id} C {1, . . . ,p} that maximize h(I) for / G Id- Then we point 
out the differences with the case of more than one subset along the lines of 
the proof. 

As the optimization is over all the d combinations of the p variable indices, 
a finite number, it suffices to show that 

lim h n (I) = h(I) a.s., for all / G I d . (1) 

n— >oo 

Indeed, since there exist a unique set Iq G I p that maximizes h(I), there 
exists rj > such that 

h(I ) > h(I) +7] , for all I ^ I , 1 G I p . 

We have from (JJ that for all / G I p 

\h n {I) - h{I)\ < |, if n > no(I,u), 

which entails 

h n {I) < h(I) + | < h(I ) - |, if J ^ J . (2) 
Since we also have 

Kilo) > Kh) - \ > h(I) - |, 
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we conclude that Iq maximizes h n (I) if n > Uq(w) a.s. 

If there exists more than one subset in J^o, the argument is the same by 
replacing I by I d>0 . 

Now, it remains to show (JT|l . which reduces to prove that 
I™ - E^(^)=*}^{/„pg)=*} = WW = = ^ a - s '' ( 3 ) 

?i— >oo 77, ' ' J 
3=1 

for k = 1,...,K. 

Finally, the equation (J3J) will follow if we show that, for all fixed k, 
1 n 

lim -Y,l{Mx 3 )=k } Z { MY j)= k } = P(f(X) = k, f(Y) = k) a.s. (4) 

3=1 

and 

1 n 

lim -5Z J {/™(^)=fc}[ J {/n(-Y;)=fc} -X {/n{ y j)=fc} ] = P(f{X) = k,f{Y) = k) a.s. 

3=1 

First we show that by the Assumption la we have 
1 n 

lining- ^2l {MXj )=k}I{f n (Y 3 )=k} -T {fiXj)=k} l {f{Yj )=k} = a.s.. (6) 

3=1 

The left hand side of © is majorized by 

~ Yl \ I {MXj)=k} X {MY J )=k}-?{f(X J )=k}?{f(Y J )=k}\ + 

{X J €C(6,r)}n{y j -6C(e,r)} 

- Yl l^»(^ J 0=*} T {/nC^-)=*} -:Z {/(^i)=*} 2 {/«)=*}l 

{X^C(e,r)}U{Y^C(e,r)} 

The first term converges to zero for any e and r by Assumption la, while the 
second term is dominated by 

-#{1 < j < n : {X 3 £ C(e,r) U {Y j £ C(e,r)}} 
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which converges a.s. to 

P(({X £C(e,r)U{Y £C(e,r)}). 

Since this last limit can be made arbitrarily small choosing e and r adequately, 
p|) holds. Finally from the Law of Large Numbers we get 

1 " 

which concludes the proof of (@J). 

For the proof of the equation (j3J), the way in which the random vectors 
X* and Ij have been defined, for a fixed subset /, implies that all the %- 
coordinates of X* — Yj are zero for % G I, while the rest of them (for i ^ /) 
are given by 

X\i]-E(X\i}). 

We recall that X[i] stands for the i-coordinate of the vector X, and X[i] = 

The vectors X* — Yj are all the same (i.e. the difference do not depend 
on j), and are given by 

(X* - Yj) [i] = (X[i\ - E(X[i\))X m} , for j = 1, . . .n. (7) 

^From the Law of Large Numbers we get 

max \\X* — Yj\ \ ^ a.s. 

j=l,...,n 

The proof of © will be complete if we show that 

# {j : f n (X*) = k, f n {Yj) + k, f n {Xj) = k] jn - a.s., (8) 

and 

# {j : f n (Yj) = k, f n (X*) ± k, f n (Xj) = k}/n^0 a.s. (9) 
We now define the sets B and C as follows: 

B = I to G : max - YA \ 1 , 

[ j=l,...,n 3 J 
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Cj = |cj G f2 : d(Xj, dG { k n) ) - d(Xj, dG k ) ^ o} 

and 

00 

C = f]C r 

3=1 

By the Assumption lb we have that P(B (1(7) = 1. Therefore, given 5 > 
and we5flC, there exists n = no(u, 5) such that 

ma,Xj = i t „^ n \\X* — Yj\ \ < 5/2. 

Given u G B fl C, we also have the following inclusions: 

{j : f n (X*) = k, f n (Yj) + k, f n {X 5 ) = k) C [j : 0G< B >) < 5} 

c{j:d(Y j ,dG k )<25}, 

which imply that the left hand side of (jHJ) is majorized by 

1 n 

# {j ■ d(Yj, dG k ) < 25} /n < - ^T{ d {Y h dG k )<25}, 

U i=i 

which converges, as n — > 00, to 

P(rf(F,9G fc ) <25). 
Finally, from the Assumption 2 we get that 

limP(d(F,<9G fc ) < 25) = 0, 
which concludes the proof of (JBJ). The proof of © is completely analogous. 

Proof of Theorem 2 

The proof goes on the same lines as the proof of Theorem 1. The only 
difference is that now 

max \\X* — Zj\ \ — > a.s. 

j=l,...,n 

follows from Assumption 4. 
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